A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

نویسندگان

Monica D. Ramstetter

Thomas D. Dyer

Donna M. Lehman

Joanne E. Curran

Ravindranath Duggirala

John Blangero

Jason G. Mezey

Amy L. Williams

چکیده

Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these methods in real data has been lacking. Here, we report an assessment of 11 state-ofthe-art relatedness inference methods using a dataset with 2,485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (∼93% − 99%) when reporting first and second degree relationships, but their accuracy dwindles to less than 60% for fifth degree relationships. However, the inferred relationships were correct to within one relatedness degree at a rate of 83% − 99% across all methods and considered relationship degrees. Furthermore, most methods infer unrelated individuals correctly at a rate of ∼99%, suggesting a low rate of false positives. Overall, the most accurate methods were ERSA 2.0 and approaches that classify relationships using the IBD segments inferred by Refined IBD and IBDseq. Combining results from the most accurate methods provides little accuracy improvement, indicating that novel approaches for relatedness inference may be needed to achieve a sizeable jump in performance. The recent explosive growth in sample sizes of genetic datasets has led to an increasing proportion of close relatives hidden within these large studies, necessitating relatedness detection. Inferring relatedness between samples is an essential step in performing genetic association studies and linkage analysis, is a powerful tool for forensic genetics, and is needed to account for or remove relatives in population genetic analyses. Relatedness estimation has also drawn the interest of the general public via companies such as 23andMe and AncestryDNA which advertise their ability to find and report relatives, allowing individuals to explore their ancestry and genealogy. The broad utility of relatedness estimation has motivated the development of numerous methods for such inference. These methods work by estimating the proportion of the genome shared identical by descent (IBD) between individuals or a closely-related quantity, where an allele in two or more individuals’ genomes is said to be IBD if those individuals inherit it from a recent common ancestor. As previously shown, the distributions of IBD proportions for different relatedness classes (such as first cousins and half-first cousins) are expected to overlap, posing a challenge for these inference procedures. Here, we present a rigorous evaluation of 11 state-of-the-art methods that can scale to large study sizes, including seven that directly infer genome-wide relatedness measures and four IBD segment detection methods that we utilized to infer these quantities. To assess each of these methods, we used SNP array genotypes from Mexican American individuals contained in large pedigrees from the San Antonio Mexican American Family Studies (SAMAFS). Our analysis sample included 2,485 individuals genotyped at 521,184 SNPs (Supplemental Note) within pedigrees that span up to six generations with genotype data ∗Correspondence: [email protected] (M.D.R.), [email protected] (A.L.W) 1 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/106013 doi: bioRxiv preprint first posted online Feb. 4, 2017;

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

متن کامل

Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium.

Estimates of relatedness have several applications such as the identification of relatives or in identifying disease related genes through identity by descent (IBD) mapping. Here we present a new method for identifying IBD tracts among individuals from genome-wide single nucleotide polymorphisms data. We use a continuous time Markov model where the hidden states are the number of alleles shared...

متن کامل

Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State

It is an assumption of large, population-based datasets that samples are annotated accurately whether they correspond to known relationships or unrelated individuals. These annotations are key for a broad range of genetics applications. While many methods are available to assess relatedness that involve estimates of identity-by-descent (IBD) and/or identity-by-state (IBS) allele-sharing proport...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

نویسندگان

چکیده

منابع مشابه

Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium.

Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State

عنوان ژورنال:

اشتراک گذاری